Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logging]metrics ops logs include event loop delay histogram data #120451

Conversation

TinaHeiligers
Copy link
Contributor

@TinaHeiligers TinaHeiligers commented Dec 5, 2021

Resolves #119069

  • Adds 50th, 95th & 99th percentiles (in ms) of the event loop delay histogram to metrics.ops logs
  • Ensures the event loop delay histogram data is provided by the /api/status API to allow Monitoring UI to render more detailed metrics.

Logs

debug logs

  • pattern layout: "[%date][%level][%logger][%meta]---%message"
[2021-12-05T14:02:52.138-07:00][DEBUG][metrics.ops][{"event":{"kind":"metric","category":["process","host"],"type":["info"]},"process":{"uptime":37,"memory":{"heap":{"usedInBytes":494344840}},"eventLoopDelay":11.621535925407926,"eventLoopDelayHistogram":{"50":11.550719,"95":12.623871,"99":12.664831}},"host":{"os":{"load":{"1m":3.7490234375,"5m":8.82666015625,"15m":8.89697265625}}}}]---memory: 471.4MB uptime: 0:00:37 load: [3.75,8.83,8.90] mean delay: 11.622 delay histogram: { 50: 11.551; 95: 12.624; 99: 12.665 }
{"event":{"kind":"metric","category":["process","host"],"type":["info"]},"process":{"uptime":37,"memory":{"heap":{"usedInBytes":494344840}},"eventLoopDelay":11.621535925407926,"eventLoopDelayHistogram":{"50":11.550719,"95":12.623871,"99":12.664831},"pid":4989},"host":{"os":{"load":{"1m":3.7490234375,"5m":8.82666015625,"15m":8.89697265625}}},"ecs":{"version":"8.0.0"},"@timestamp":"2021-12-05T14:02:52.138-07:00","message":"memory: 471.4MB uptime: 0:00:37 load: [3.75,8.83,8.90] mean delay: 11.622 delay histogram: { 50: 11.551; 95: 12.624; 99: 12.665 }","log":{"level":"DEBUG","logger":"metrics.ops"},"trace":{"id":"8bcb3247c71a5ae90736d39169f9ab91"},"transaction":{"id":"9b3566667b7effed"}}
  • json layout (also shown prettified below):
{"event":{"kind":"metric","category":["process","host"],"type":["info"]},"process":{"uptime":146,"memory":{"heap":{"usedInBytes":515810168}},"eventLoopDelay":11.739656432941176,"eventLoopDelayHistogram":{"50":11.829247,"95":12.615679,"99":12.664831},"pid":5399},"host":{"os":{"load":{"1m":1.939453125,"5m":2.697265625,"15m":4.89892578125}}},"ecs":{"version":"8.0.0"},"@timestamp":"2021-12-05T14:15:53.772-07:00","message":"memory: 491.9MB uptime: 0:02:26 load: [1.94,2.70,4.90] mean delay: 11.740 delay histogram: { 50: 11.829; 95: 12.616; 99: 12.665 }","log":{"level":"DEBUG","logger":"metrics.ops"},"trace":{"id":"cd6d60eb7fcc200e8d0915b2e91c914e"},"transaction":{"id":"e68d05db97f9244c"}}
json layout prettified
{
  "event": {
    "kind": "metric",
    "category": [
      "process",
      "host"
    ],
    "type": [
      "info"
    ]
  },
  "process": {
    "uptime": 146,
    "memory": {
      "heap": {
        "usedInBytes": 515810168
      }
    },
    "eventLoopDelay": 11.739656432941176,
    "eventLoopDelayHistogram": {
      "50": 11.829247,
      "95": 12.615679,
      "99": 12.664831
    },
    "pid": 5399
  },
  "host": {...}
  },
  "ecs": {
    "version": "8.0.0"
  },
  "@timestamp": "2021-12-05T14:15:53.772-07:00",
  "message": "memory: 491.9MB uptime: 0:02:26 load: [1.94,2.70,4.90] mean delay: 11.740 delay histogram: { 50: 11.829; 95: 12.616; 99: 12.665 }",
  "log": {
    "level": "DEBUG",
    "logger": "metrics.ops"
  },
  "trace": {
    "id": "cd6d60eb7fcc200e8d0915b2e91c914e"
  },
  "transaction": {
    "id": "e68d05db97f9244c"
  }
}

/api/status

network response body
{
  "name": "christianes-mbp.lan",
  "uuid": "5b2de169-2785-441b-ae8c-186a1936b17d",
  "version": {
    "number": "8.1.0",
    "build_hash": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "build_number": 9007199254740991,
    "build_snapshot": false
  },
  "status": {...},
  "metrics": {
    "last_updated": "2021-12-05T21:38:23.532Z",
    "collection_interval_in_millis": 5000,
    "os": {...},
    "process": {
      "memory": {...},
      "pid": 5830,
      "event_loop_delay": 12.21699262745098,
      "event_loop_delay_histogram": {
        "min": 9.076736,
        "max": 70.385663,
        "mean": 12.21699262745098,
        "exceeds": 0,
        "stddev": 5.6599237565600395,
        "fromTimestamp": "2021-12-05T21:38:18.533Z",
        "lastUpdatedAt": "2021-12-05T21:38:23.528Z",
        "percentiles": {
          "50": 11.116543,
          "75": 12.599295,
          "95": 16.793599,
          "99": 41.287679
        }
      },
      "uptime_in_millis": 966734.90539
    },
    "processes": [
      {
        "memory": {...},
        "pid": 5830,
        "event_loop_delay": 12.21699262745098,
        "event_loop_delay_histogram": {
          "min": 9.076736,
          "max": 70.385663,
          "mean": 12.21699262745098,
          "exceeds": 0,
          "stddev": 5.6599237565600395,
          "fromTimestamp": "2021-12-05T21:38:18.533Z",
          "lastUpdatedAt": "2021-12-05T21:38:23.528Z",
          "percentiles": {
            "50": 11.116543,
            "75": 12.599295,
            "95": 16.793599,
            "99": 41.287679
          }
        },
        "uptime_in_millis": 966734.90539
      }
    ],
    "response_times": {...},
    "concurrent_connections": 1,
    "requests": {...},
      "status_codes": {
        "200": 1,
        "304": 122
      }
    }
  }
}

Checklist

Delete any items that are not applicable to this PR.

How to test this:

Pull this branch and run kibana with trial license and the following logging configuration:

// in kibana.yml
logging:
  appenders:
    console:
      type: console
      layout:
        type: pattern
        highlight: true
        pattern: "[%date][%level][%logger][%meta]---%message"
  root:
    appenders: [console]
    level: warn
  loggers:
    - name: metrics.ops
      appenders: [console]
      level: debug
    - name: monitoring
      level: debug
      appenders: [console]

After kibana starts up, wait a few moments to see the output of the metrics.ops logs. Observe that the eventLoopDelayHistogram appears in the log message and under the process field in the log meta.

Navigate to the status page: localhost:5601/<basepath>/status.
Open the browsers' dev tools and inspect the network response from the status api endpoint. Observe that the eventLoopDelay histogram percentiles appears under the metric/process and metric/processes keys.

Risk Matrix

Risk Probability Severity Mitigation/Notes
Ingesting metrics ops logs and relying on the metrics.ops message to parse data is risky as the message string has changed Low Low Relying on a particular log message is flaky, use the log meta instead. This PR changes the wording, and adds another entry.

Release note

Adds 50th, 95th and 99th event loop delay histogram percentiles to metrics.ops logger message and meta field surfaced when logging in debug mode or in json layout.

@elastic elastic deleted a comment from kibana-ci Dec 5, 2021
const logMeta = getEcsOpsMetricsLog(createBaseOpsMetrics());
expect(logMeta).toMatchInlineSnapshot(`
const logMeta = getEcsOpsMetricsLog(createMockOpsMetrics(testMetrics));
expect(logMeta.meta).toMatchInlineSnapshot(`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot was getting too big to easily notice a change, so I've changed this assertion to only test the meta field.

const eventLoopDelayVal = process?.event_loop_delay;
const eventLoopDelayValMsg = eventLoopDelayVal
? `delay: ${numeral(process?.event_loop_delay).format('0.000')}`
? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}`
Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Dec 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the histogram percentiles in the log message, we need to let folks know that the first delay entry is the mean.

Copy link
Member

@lukeelmers lukeelmers Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API?

"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"

(Edit: Mostly just trying to think how we can keep this as succinct as possible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API.

We could but that would be a breaking change for folks ingesting the log message. ATM, if someone was parsing the message string they might expect delay: number and end up with delay: [Object].

Admittedly, this PR also changes the message string but keeps the delay type the same. Appending the additional data seems like a less risky option here.

Admittedly, ingesting log message data isn't the best approach and folks should be using the meta instead. However, we removed meta from all the logs expect for debug mode so the extra info wouldn't be available for other levels.

Copy link
Member

@lukeelmers lukeelmers Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but that would be a breaking change for folks ingesting the log message.

Yeah, although I suppose we are adding mean delay so a grok pattern against just delay could match two results now (both mean and histogram). EDIT: Actually I guess not -- I misread, we have mean delay and delay histogram 🙂

However, in the past we have not considered the message format to be stable, and have made breaking changes to it in minors anyway. So I am not super concerned about this either way.

Copy link
Contributor Author

@TinaHeiligers TinaHeiligers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

@TinaHeiligers TinaHeiligers added the backport:skip This commit does not require backporting label Dec 5, 2021
@TinaHeiligers TinaHeiligers marked this pull request as ready for review December 5, 2021 22:31
@TinaHeiligers TinaHeiligers requested a review from a team as a code owner December 5, 2021 22:31
@TinaHeiligers TinaHeiligers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Dec 6, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Are we planning on updating the /status page to display these additional values btw?

@@ -75,7 +95,7 @@ export function getEcsOpsMetricsLog(metrics: OpsMetrics) {
};

return {
message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}`,
message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}${eventLoopDelayHistMsg}`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT/optional: I wonder if

message: [
  processMemoryUsedInBytesMsg,
  uptimeValMsg,
  loadValsMsg,
  eventLoopDelayValMsg,
  eventLoopDelayHistMsg,
].join('');

isn't more readable

Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning on updating the /status page to display these additional values btw?

It's not in the scope for this issue but I don't see why not!
How would we display the data? We could use the same cards we use now and add each entry individually or perhaps we could add another level to the metrics cards with details that would expand if a "show more" or "details" was clicked?
Maybe a simpler option would be to only show say the 50th, 95th and 99th percentiles in the same delay card, at the bottom in smaller text?

Update: I created #120667 for discussion on the approach.

Copy link
Member

@lukeelmers lukeelmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning on updating the /status page to display these additional values btw?

++ This would be a nice addition

const eventLoopDelayVal = process?.event_loop_delay;
const eventLoopDelayValMsg = eventLoopDelayVal
? `delay: ${numeral(process?.event_loop_delay).format('0.000')}`
? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}`
Copy link
Member

@lukeelmers lukeelmers Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API?

"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"

(Edit: Mostly just trying to think how we can keep this as succinct as possible)

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@TinaHeiligers TinaHeiligers merged commit 80dc3b6 into elastic:main Dec 7, 2021
@TinaHeiligers TinaHeiligers deleted the improve-event-loop-delay-representation branch December 7, 2021 20:04
@TinaHeiligers TinaHeiligers added v7.16.1 v7.16.2 v7.17.0 auto-backport Deprecated - use backport:version if exact versions are needed and removed backport:skip This commit does not require backporting labels Dec 8, 2021
@kibanamachine
Copy link
Contributor

💔 Backport failed

Status Branch Result
7.16
7.17 The branch "7.17" is invalid or doesn't exist

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
node scripts/backport --pr 120451

@TinaHeiligers TinaHeiligers restored the improve-event-loop-delay-representation branch December 8, 2021 15:09
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 120451 or prevent reminders by adding the backport:skip label.

@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Dec 9, 2021
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 120451 or prevent reminders by adding the backport:skip label.

1 similar comment
@kibanamachine
Copy link
Contributor

Friendly reminder: Looks like this PR hasn’t been backported yet.
To create backports run node scripts/backport --pr 120451 or prevent reminders by adding the backport:skip label.

@lukeelmers lukeelmers added v8.0.0 auto-backport Deprecated - use backport:version if exact versions are needed and removed backport missing Added to PRs automatically when the are determined to be missing a backport. auto-backport Deprecated - use backport:version if exact versions are needed labels Dec 13, 2021
@kibanamachine
Copy link
Contributor

💔 Backport failed

Status Branch Result
8.0
7.17 The branch "7.17" is invalid or doesn't exist

Successful backport PRs will be merged automatically after passing CI.

To backport manually run:
node scripts/backport --pr 120451

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Logging release_note:enhancement Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.17.0 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improving event loop delay representation
6 participants